A hybrid recommendation algorithm based on user nearest neighbor model

In the realm of e-commerce, personalized recommendations are a crucial component in enhancing user experience and optimizing sales efficiency. To address the inherent sparsity challenge prevalent in collaborative filtering algorithms within personalized recommendation systems, we propose a novel hybrid e-commerce recommendation algorithm based on the User-Nearest-Neighbor model. By integrating the user nearest neighbor model with other recommendation algorithms, this approach effectively mitigates data sparsity and facilitates a more nuanced understanding of the user-product relationship, consequently elevating recommendation quality and enhancing user experience. Taking into account considerations such as data scale and recommendation performance, we conducted experiments utilizing the Spark distributed platform. Empirical findings demonstrate the superiority of our hybrid algorithm over standalone collaborative filtering algorithms across various recommendation indicators.


A hybrid recommendation algorithm based on user nearest neighbor model
The rapid expansion of the Internet has rendered e-commerce an indispensable facet of contemporary business operations.Within the e-commerce domain, recommender systems (RS) 1-5 play a pivotal role in augmenting user satisfaction, boosting sales, and fortifying platform competitiveness 6 .Noteworthy is the substantial success RS has achieved across diverse business landscapes, exemplified by globally recognized platforms like Amazon, Netflix, TripAdvisor, and Last.fm 3 .However, in the era of big data, the development of high-caliber personalized e-commerce recommendation systems faces formidable hurdles, particularly in grappling with vast and sparsely populated user-item interaction datasets.
Collaborative filtering [7][8][9][10] is a popular personalized recommendation technology that relies on the similarity between users or items to generate recommendations.In recent years, the optimization strategies for collaborative filtering models have been continuously enriched, as shown in the literature [11][12][13] .Within collaborative filtering models, matrix factorization (MF) 14 , multilayer perceptron (MLP) 15 , and neural collaborative filtering (NCF) 16 are the most classic and widely used algorithms.However, these algorithms face significant challenges due to the high dimensionality and sparsity of the user-item interaction matrix 3,17 .In real-world scenarios, users typically interact with less than 1% of the total number of items.Consequently, the entire user-item interaction matrix is both high-dimensional and sparse.Due to this sparsity and high dimensionality, traditional collaborative filtering models struggle to effectively capture implicit user and item features, resulting in a decrease in recommendation quality.
To tackle the challenge posed by the sparsity of the user-item interaction matrix, we introduced an innovative hybrid recommendation algorithm 18 .The fundamental concept of this algorithm is to optimize the user-item interaction matrix and alleviate its data sparsity.The specific methodology is outlined as follows: initially, the missing values within the original user-item interaction matrix are forecasted using the user nearest neighbor model.Subsequently, the optimized user-item interaction matrix serves as the input for the collaborative filtering model.Ultimately, a personalized recommendation list is generated based on this processed data.
This method can effectively alleviate the sparsity of the original user-item interaction matrix and reduce the impact of matrix sparsity on the collaborative filtering model.Additionally, leveraging the distributed computing capabilities of the Apache Spark platform ensures efficient computation of high-dimensional and large-scale matrices.

User-item interaction matrix
The User-Item Interaction Matrix is a widely utilized data structure in the field of personalized recommendation, employed to depict the interactions and associations between users and items.It is typically represented as a two-dimensional matrix, where rows correspond to users, columns correspond to items, and each element signifies the interaction between users and items.As illustrated in Table 1.

User nearest neighbor model
The basic principle of the User Nearest Neighbor (UNN) model is to make predictions based on the behavior of the users most similar to the target user.The core idea involves calculating the similarity between users based on the frequency of interactions with products and the set of interacted products.For each user u, the nearest neighbor user v is identified.Then, based on the interaction frequency of user v with some products that user u has not interacted with (but which the nearest neighbor user v has interacted with), and the similarity between users, the interaction frequency of user u with these non-interacted products is predicted.The specific calculation formula is as follows: Among them, I u and I v represent the sets of products interacted by users u and v respectively; Sim(u, v) repre- sents the similarity between users u and v; R ui and R vi represent the interaction frequency of users u and v for product i respectively.

Similarity metric
Traditional methods for calculating user similarity include Euclidean distance 25 , cosine similarity 1,26 , and Pearson correlation coefficient 17,26 .The specific formula is as follows:

• Pearson correlation coefficient
Where I uv represents the set of items co-interacted by both user u and v; R ui and R vi respectively indicate the interaction frequency of users u and v for item i; Ru and Rv represent the average interaction frequency of users u and v for their interacted items.
Traditional similarity measurement relies on the behavioral data of the items that users u and v interact with.However, due to the inherent sparsity of the user-item interaction matrix, the number of items that two users interact with is usually very limited, resulting in a large estimation error in the calculated user similarity.For example, if two users interact with only one item, but their interaction frequencies are the same, the traditional similarity measurement will consider them to have a high similarity, which is unreasonable.
To tackle this issue, our study introduces a weighted similarity measurement method.The core principles of this method are twofold: firstly, the higher the number of items two users interact with, the greater their similarity; secondly, within the set of interacted items, lower popularity indicates a stronger reflection of user differences.This article utilizes the Euclidean distance as an example to elucidate this concept.The specific calculation formula for weighted similarity is as follows: Among them, pop i represents the popularity value of product i; U represents the set of all users; U i represents the set of users who have interacted with product i; R ui represents the interaction frequency of user u with product i.
By considering both the number of co-interactions and item popularity weights, this weighted similarity measurement method minimizes estimation errors and offers a more precise depiction of user similarity.

Alternating least squares method
Alternating Least Squares (ALS) 27 is a widely used MF optimization algorithm.Taking implicit feedback 28,29 data as an example, the specific steps of the ALS algorithm are summarized as follows: • Randomly initialize the user matrix P and the item matrix Q.
• Treat the user matrix P as a fixed value and update the item matrix Q.
• Treat the item matrix Q as a fixed value and update the user matrix P.
• Continue alternating iterations until the loss function converges or reaches the maximum iteration limit.
Where n and m represent the number of users and items respectively; p u and q i denote the feature vectors of user u and item i respectively; U and I represent the sets of users and items respectively; C ui represents the confidence of user u for item i; y ui indicates whether user u prefers item i; denotes the regularization coefficient; E represents the identity matrix.The specific formulas for C ui and y ui are as follows: Where α represents the confidence coefficient; R ui represents the interaction frequency of user u for product i.

Multilayer perceptron
Multilayer Perceptron (MLP) 15 is a foundational feedforward artificial neural network, constituting a fundamental component of deep learning architectures.It typically comprises an embedding layer, one or more hidden layers, and an output layer.The training process of the MLP model unfolds as follows: • Embedding layer: Map users and products to a low-dimensional vector space through an embedding matrix.
Among them, p u and q i represent the low-dimensional feature vectors of user u and product i respectively; P e and Q e both represent embedding matrices; u One−Hot and i One−Hot represent the one-hot encoding of user u and product i respectively.• Hidden layer: Connect the embedded vectors into a long vector, and learn nonlinear feature interactions through multi-layer fully connected networks.
Among them, φ represents the hidden layer activation function; || represents the connection operation; W and b represent the weight and bias of the hidden layer respectively; L represents the number of hidden layers.• Output layer: Predict the user's rating or preference for the product.
Where σ represents the output layer activation function; W out and b out represent the weight and bias of the output layer respectively.

Neural collaborative filtering
Neural collaborative filtering (NCF) 16 is a recommendation algorithm that integrates deep learning with collaborative filtering principles.It leverages neural networks to capture the nonlinear relationship between users and products.The NCF model combines the strengths of MLP and generalized matrix factorization (GMF), allowing it to achieve superior recommendation outcomes in high-dimensional sparse data scenarios.The specific training methodology of the NCF model is depicted in Fig. 1: • Embedding layer: Map users and products to low-dimensional vector space through the embedding matrix.(14)  p u =P e • u One−Hot (15)   www.nature.com/scientificreports/Among them, p G u and q G i represent the low-dimensional feature vectors of user u and item i based on the GMF model respectively; P G e and Q G e both represent the embedding matrix based on the GMF model; p M u and q M i represent the low-dimensional feature vectors of user u and item i based on the MLP model respectively; P M e and Q M e both represent the embedding matrix based on the MLP model; u One−Hot and i One−Hot represent the one-hot encoding of user u and item i respectively.
• Hidden layer: Captures the complex interactive relationship between users and items through multiple non- linear transformations and feature extraction.
Among them, ⊙ represents element-by-element multiplication; h GMF and h MLP represent the hidden layer embedding vectors obtained by the GMF model and the MLP model respectively; φ L represents the activation function of the Lth layer of the hidden layer of the MLP model; W L and b L represent the weight and bias of the Lth layer of the hidden layer of the MLP model respectively; h concat represents the embedded vector after concatenation; || represents the concatenation operation.• Output layer: Predict the user's evaluation or preference for the product.
Among them, σ represents the output layer activation function; W out and b out represent the weight and bias of the output layer respectively.The ALS algorithm treats all non-interacted products as negative examples during training.However, in reality, due to data sparsity, there's a significant imbalance between positive and negative examples.Even with negative sampling, there can still be a substantial number of false negative examples.Additionally, while the MLP algorithm and NCF algorithm excel in capturing the nonlinear relationships between users and products, this strength can lead to overfitting issues when dealing with sparse matrices.Therefore, optimizing the user-product interaction matrix becomes crucial in mitigating these challenges.
This paper integrates the UNN model with the ALS, MLP, and NCF algorithms, respectively.Optimizing the user-item interaction matrix, addresses the limitations of individual recommendation algorithms in dealing with data sparsity, thereby enhancing recommendation quality.

Experimental analysis
This chapter leverages the Spark platform to evaluate the recommendation quality of the hybrid e-commerce recommendation algorithm based on the UNN model proposed in this paper.Through experiments, its performance is compared with various collaborative filtering algorithms.The results demonstrate that the hybrid recommendation algorithm outperforms single algorithms across multiple indicators.

Data set
Dataset 1: This dataset is sourced from the Alibaba mobile e-commerce platform and provided by Alibaba Cloud Tian-chi Laboratory.It comprises 834 users and 350,889 distinct items, with 1,048,575 records of user-item interactions, resulting in a data sparsity of 99.865%.The interactions between users and items in this dataset include browsing, bookmarking, adding to cart, and purchasing.Dataset 1 link: Datas et 1 link.
Dataset 2: This dataset is derived from a Taobao user behavior dataset provided by Alibaba, designed for research on implicit feedback recommendation problems.It includes 987,994 users and 4,162,024 distinct items, with 100,150,807 records of user-item interactions.For this experiment, a subset of the data was selected, comprising 1,000 users and 48,488 distinct items, with 73,980 records of user-item interactions, resulting in a data sparsity of 99.887%.The interactions between users and items in this dataset include browsing, bookmarking, adding to cart, and purchasing.Dataset 2 link: Datas et 2 link.
Dataset 3: This dataset is sourced from the Kaggle website.Due to the presence of many users interacting with only a few items in this dataset, to appropriately partition the training and testing sets, this study extracted a subset of valid data, comprising 7,023 users and 40,022 unique items, with 191,711 user-item interaction records.The data sparsity is 99.965%.The interactions between users and items in this dataset include browsing, adding to cart, and purchasing.Dataset 3 link: Datas et 3 link.

Metric
This article employs classic indicators such as F1 score, Hit Rate (HR), and Normalized Discounted Cumulative Gain (NDCG) as evaluation criteria 30 .The specific calculation formulas are as follows: • F1 score Among them, P and R represent the precision rate and recall rate respectively.U represents the set of users, R(u) represents the recommendation list of user u, and T(u) represents the actual purchase list of user u. • Hit Rate Among them, U represents the set of users; hits(u) represents whether the product purchased by user u appears in the recommendation list.If it does, it is represented as 1; otherwise, it is represented as 0.
• Normalized Discounted Cumulative Gain Among them, DCG u and IDCG u respectively represent the discounted cumulative gain and ideal cumulative gain of the user's recommended list.R(u) and T(u) represent the user's recommended list and actual purchase list respectively.rel n represents whether the user will purchase the nth product in the recommended list.If they will, it is n; otherwise, it is 0. U represents the set of users.

The general framework of the experiment
The hybrid recommendation algorithm proposed in this paper, based on the UNN recommendation and MF, mainly consists of two parts: the UNN model and the MF model.
First, the user-item interaction matrix M 1 is optimized through the UNN model to predict the user's prefer- ences for items that have not been interacted with before, thereby generating a denser user-item interaction matrix M 2 .Then, the collaborative filtering model is trained on the optimized user-item interaction matrix M 2 to effectively mitigate the negative impact of matrix sparsity on the collaborative filtering model.Finally, recommendations are generated based on the potential feature vectors of users and items.The specific workflow is illustrated in Fig. 2.
The similarity threshold µ serves as an independent variable in the UNN model, controlling the prediction accuracy within the model.Generally, the quality of the user-item interaction matrix M 2 increases as the similar- ity threshold µ decreases, reaching a peak before gradually declining.

Experimental results and analysis
Next, we will conduct experiments using three datasets on the Spark platform to verify the recommendation quality of our proposed hybrid recommendation algorithm based on the UNN model under different similarity thresholds and compare it with other collaborative filtering recommendation algorithms.
In the ALS algorithm, the dimension of the potential feature vector is set to 32 to 80.In the MLP algorithm, the number of neurons in the embedding layer ranges from 32 to 80, the number of hidden layers is 2, the number of neurons is 32 and 64 respectively, and the number of neurons in the output layer is 100.In the NCF algorithm, the embedding dimension is set to 32 to 80, and the neural network structure is the same as the MLP algorithm.
In the first part of the experiment, we will discuss in detail the effectiveness of the UNN model under different similarity thresholds.We will select an optimal similarity threshold for each dataset.Figures 3, 4, 5 and Tables 2,  3, 4 present some experimental data.
As observed from the above charts, the hybrid recommendation algorithm we proposed exhibits significant sensitivity to changes in the similarity threshold.The characteristics of this sensitivity are as follows: When the similarity threshold is set low, the UNN model generates a large amount of predicted interaction data.This data can be utilized as filler data to enrich the user-item interaction matrix and alleviate its sparsity.However, due to the less accurate prediction data generated by the UNN model at this similarity threshold, the negative impact of reduced data accuracy surpasses the positive impact of increased data density.Consequently, the entire algorithm performs poorly on some indicators.
As the similarity threshold increases, the amount of predicted interaction data generated by the UNN model decreases, while the accuracy increases.At this stage, the positive impact of increased data accuracy outweighs the negative impact of reduced data density.Therefore, all indicators of the entire algorithm show an upward trend.
When the similarity threshold reaches a certain value, although the amount of predicted interaction data generated by the UNN model will further decrease and the accuracy of predicted interaction data will increase, it essentially reaches a balanced state.Here, the positive impact of the latter and the negative impact of the former offset each other.Consequently, the recommendation quality of the entire algorithm continues to fluctuate.Eventually, as the amount of predicted interaction data generated by the UNN model approaches zero, the performance of the hybrid algorithm across various indicators will converge to that of a single algorithm.
Table 5 gives the basic information of the user-product interaction matrix based on these three datasets at different similarity thresholds.Among them, DS represents data sparsity, FV represents the amount of data filled in the user-item interaction matrix by the UNN model, and FVR represents the percentage of FV in the original data volume.www.nature.com/scientificreports/ In the second part of the experiment, building upon the results from the discussion on the similarity threshold, we delve deeper into the recommendation quality of the hybrid algorithm and the single algorithm under varying implicit feature vector dimensions.This aims to validate the stability of the hybrid algorithm.The specific details are presented in Figs. 6, 7, 8, while detailed experimental data are provided in Tables 6, 7, 8.As observed from the above chart, by selecting an appropriate similarity threshold, the UNN model enhances the data density of the user-item interaction matrix while preserving its accuracy to the greatest extent.Consequently, the hybrid algorithm combined with UNN demonstrates high stability and generally outperforms a single algorithm in all scenarios.Additionally, to ascertain the statistical significance of the hybrid algorithm's performance, we conducted p-value calculations using the t-test method on the experimental results of the hybrid algorithm under different latent factors, as presented in Table 9: Based on the analysis of the above experimental data, the hybrid recommendation algorithm we proposed combines the UNN model and collaborative filtering method.By adjusting the similarity threshold, it effectively optimizes the user-item interaction matrix and mitigates matrix sparsity to improve the quality of recommendations.Furthermore, t-test results indicate that in most cases, there is a 95% probability that the performance of the hybrid algorithm significantly differs from that of a single algorithm.www.nature.com/scientificreports/However, the effectiveness of this algorithm largely hinges on the optimization achieved by the UNN model on the user-item interaction matrix.Unfortunately, the UNN model in this paper relies on user similarity, which may lead to insufficient optimization of the user-item interaction matrix.Therefore, future research could explore more efficient and accurate algorithms to enhance the UNN model.
For instance, integrating graph neural networks (GNNs) 31 could help capture implicit relationships between users, while combining multiple information sources such as user behavior data, social network information, and product content for feature fusion could improve the accuracy of user similarity calculations under sparse data conditions.This approach would effectively enhance the optimization effect of the UNN model on the user-item interaction matrix.

Conclusion
Collaborative filtering remains a crucial area of research within recommender systems.This study introduces a novel approach focusing on optimizing the user-item interaction matrix using the user nearest neighbor (UNN) model, which refines the training matrix for collaborative filtering algorithms.Additionally, we integrate the UNN model with multiple recommendation algorithms to evaluate the effectiveness of our hybrid recommendation approach.Experimental results indicate that this strategy significantly reduces the negative effects of matrix sparsity on collaborative filtering algorithms.Furthermore, leveraging a distributed platform enables efficient processing of large-scale matrices, thereby enhancing model training efficiency.
The recommendation quality of the hybrid recommendation algorithm proposed in this paper heavily relies on the optimization effectiveness of the UNN model for the user-item interaction matrix.Moreover, the accuracy of user similarity measurement is crucial for the efficacy of the UNN model.In scenarios where user and www.nature.com/scientificreports/item interactions are sparse, the precision of user similarity measurement may decrease, thereby affecting the overall recommendation quality of the algorithm.Furthermore, when dealing with larger datasets, enhancing the hybrid recommendation algorithm's capability to mitigate data sparsity becomes imperative.Future research efforts could focus on improving user similarity measurement accuracy and enhancing the algorithm's scalability to address these challenges effectively.
Lv 1 , Jiabin Wang 1* , Fan Deng 1,2 & Penggui Yan 1,2 In the realm of e-commerce, personalized recommendations are a crucial component in enhancing user experience and optimizing sales efficiency.To address the inherent sparsity challenge prevalent in collaborative filtering algorithms within personalized recommendation systems, we propose a novel hybrid e-commerce recommendation algorithm based on the User-Nearest-Neighbor model.By integrating the user nearest neighbor model with other recommendation algorithms, this approach effectively mitigates data sparsity and facilitates a more nuanced understanding of the user-product relationship, consequently elevating recommendation quality and enhancing user experience.Taking into account considerations such as data scale and recommendation performance, we conducted experiments utilizing the Spark distributed platform.Empirical findings demonstrate the superiority of our hybrid algorithm over standalone collaborative filtering algorithms across various recommendation indicators.

Figure 1 .
Figure 1.Schematic diagram of the NCF model framework 16 .

Figure 2 .
Figure 2. General framework diagram of the algorithm.

Figure 3 .Figure 4 .
Figure 3. Schematic diagram of the effectiveness of the UNN model based on different similarity thresholds in Dataset 1.

Figure 5 .
Figure 5. Schematic diagram of the effectiveness of the UNN model based on different similarity thresholds in Dataset 3.

Figure 6 .
Figure 6.Schematic diagram of the effectiveness of the UNN model based on different latent factors in Dataset 1.

Figure 7 .
Figure 7. Schematic diagram of the effectiveness of the UNN model based on different latent factors in Dataset 2.

Figure 8 .
Figure 8. Schematic diagram of the effectiveness of the UNN model based on different latent factors in Dataset 3.

Table 2 .
Effectiveness of UNN models based on different similarity thresholds in Dataset 1. Significant values are in bold.

Table 3 .
Effectiveness of UNN models based on different similarity thresholds in Dataset 2. Significant values are in bold.

Table 4 .
Effectiveness of UNN models based on different similarity thresholds in Dataset 3. Significant values are in bold.

Table 5 .
Basic information table of user-item interaction matrix under different similarity thresholds.

Table 6 .
Effectiveness of UNN models based on different latent factors in Dataset 1.

Table 7 .
Effectiveness of UNN models based on different latent factors in Dataset 2.

Table 8 .
Effectiveness of UNN models based on different latent factors in Dataset 3.

Table 9 .
T-test experimental data table of the hybrid algorithm.